CERENKOV3: Clustering and molecular network-derived features improve computational prediction of functional noncoding SNPs.

Identification of causal noncoding single nucleotide polymorphisms (SNPs) is important for maximizing the knowledge dividend from human genome-wide association studies (GWAS). Recently, diverse machine learning-based methods have been used for functional SNP identification; however, this task remains a fundamental challenge in computational biology. We report CERENKOV3, a machine learning pipeline that leverages clustering-derived and molecular network-derived features to improve prediction accuracy of regulatory SNPs (rSNPs) in the context of post-GWAS analysis. The clustering-derived feature, locus size (number of SNPs in the locus), derives from our locus partitioning procedure and represents the sizes of clusters based on SNP locations. We generated two molecular network-derived features from representation learning on a network representing SNP-gene and gene-gene relations. Based on empirical studies using a ground-truth SNP dataset, CERENKOV3 significantly improves rSNP recognition performance in AUPRC, AUROC, and AVGRANK (a locus-wise rank-based measure of classification accuracy we previously proposed).


Introduction The rSNP detection problem
Genome-wide association studies (GWAS) are increasingly being used to map the genes that underlie human polygenic traits. GWAS have uncovered variant-to-trait associations in thousands of studies collectively involving millions of individuals. 1 Functional interpretation of genetic loci identified through GWAS has primarily focused on coding regions in which SNPs can be explained based on amino acid changes; 2 however, 90% of human GWAS-identified SNPs are located in noncoding regions 3 from which it is difficult to pinpoint the regulatory SNP (or rSNP) that is causal for trait variation. 4 With the tremendous increase in genomic and functional genomic datasets, computational data-driven approaches have become a mainstay of functional rSNP prioritization, although the rSNP identification problem remains fundamentally challenging. While some unsupervised approaches, which do not involve training based on an example set of experimentally validated rSNPs, have been proposed, [5][6][7][8][9][10][11][12][13][14][15] evidence from our work 16,17 and others' [18][19][20][21] suggests that supervised approaches in general have superior rSNP detection accuracy. In addition, the growth of literature-curated databases of experimentally validated rSNPs [22][23][24] has stimulated the development of supervised approaches. A variety of supervised classification algorithms have been proposed, including the SVM, 11,13,18,25 naïve Bayes, 26 ensemble decision tree algorithms, 19,20,27 probabilistic graphical models, 12,28 deep neural networks, 14,21,29 weighted sum of feature ranks, 30 and our work using regularized gradient boosted decision trees 16,17 and deep residual networks. 31 Recently, several hybrid methods have been proposed such as combining recurrent and convolutional neural networks 21 and combining deep neural networks with regularized gradient boosted decision trees. 29 In addition to binary classification approaches, regression-based approaches have been used for the rSNP detection problem. 32,33 For rSNP detection, as with other machine-learning problems, the features (in this case, SNP features) are as important as the classification algorithm. Consequently, various types of SNP annotations that correlate with functional rSNPs have been used, 34 for example, phylogenetic sequence conservation and expression quantitative trait locus (expression QTL, or eQTL) association 35 scores. Furthermore, studies have shown that increasing the diversity of SNP annotation features improves rSNP detection, and thus there has been a steady increase in the number of features used in machine-learning approaches for this problem. 10,[18][19][20][21]26,27,29 In our previous work 17 we reported a model (Computational Elucidation of the REgulatory NonKOding Variome, CERENKOV) with a 246-dimensional feature space that clearly outperformed some models 20,21,29 with significantly higher-dimensional feature spaces. This suggested that feature correlation within, and sparsity of, high-dimensional feature space may weaken the improvement of rSNP detection accuracy. Therefore, how to identify and integrate various types of rSNP correlates remains a key challenge for accurate rSNP detection.

Our previous CERENKOV methods
Our previous classifier, CERENKOV, 17 had four key innovations. First, we selected a reference set of SNPs to represent noncoding loci that would be expected to be encountered in a post-GWAS analysis, based on population minor allele frequency. 17 Second, we used a regularized gradient boosted decision tree (XGBoost) classification algorithm, 36 which we found has superior rSNP recognition performance to Random Forest and Kernel SVM. Third, we engineered 246 SNP-level features from phylogenetic, genomic, epigenomic, chromatin structural, cistromic, population genetic, replication-timing, and functional genomic datasets. Fourth, CERENKOV incorporated a locus-wise rank-based measure of classification accuracy, AVGRANK, 17 which more realistically models the costs associated with incorrect predictions in post-GWAS analysis than typical measures like area under the receiver operating characteristic curve (AUROC) or area under the precision-recall curve (AUPRC). We compared the accuracy of CERENKOV to nine previously published rSNP recognition models 17 and found that CERENKOV's performance significantly improved upon the nine other models, by AUPRC, AUROC, and AVGRANK.
More recently, we reported on CERENKOV2, 16 which improved performance over CERENKOV by leveraging insights into the data-space geometry of the problem. In addition to using a 2.5-fold expanded reference set of SNPs (the OSU18 SNP set which has 39,083 SNPs for model benchmarking), we incorporated new features that are based on likelihood ratios of average SNP-to-neighboring-SNPs distances for various types of distance measures. By taking account geometric properties of the distribution of SNPs in data space, CERENKOV2 achieved significantly better rSNP recognition performance than CERENKOV and (as with CERENKOV) it outperformed the next-strongest rSNP detection tool, GWAVA. 19

CERENKOV3: new clustering-derived and network-derived features
Clustering: Clustering is a widely used technique for statistical data analyses. In GWAS, SNP clustering can help detect groups of similar SNPs that are amenable to classification using group-wise models. 37 To find useful SNP partitions, one option is to use specific domain knowledge to group the SNPs, for example by target genes and/or functional pathways. Another option is to use hierarchical clustering methods which rely on a distance measure between the SNPs. For example, SNPs can be clustered based on their pairwise relation given by stagewise regression coefficients using average linkage and the result of clustering helps alleviate the dimensionality problem when training deep Boltzmann machines. 38 Therefore, we hypothesized that deriving a feature to explicitly account for SNP clustering could improve performance for rSNP detection in the context of supervised classification.
Molecular network: Molecular networks, especially gene regulatory networks (GRN), are also key to the rSNP detection problem because molecular networks mediate the effects exerted by rSNPs on trait population variation. Therefore we hypothesized that mapping SNPs to a molecular network, and deriving features from the vantage points of SNPs in the network, would benefit rSNP detection. Construction of such a network is greatly aided by the recent availability of datasets from studies of direct DNA contacts utilizing assays such as Hi-C or chromatin interaction analysis by paired-end tag sequencing (ChIA-PET). Moreover, molecular networks have been successfully used to improve the inference accuracy of causal coding variants. [39][40][41][42] However, although biologically intuitive, the complex interactions reflected by the underlying GRNs in which noncoding rSNPs take effect, namely, interactions among transcription factors and their target genes, are largely not taken into account in existing algorithms for functional SNP identification. 43 We endeavored to capture such network-contextual information as new SNP-level features in CERENKOV3.
CERENKOV3: CERENKOV3 takes advantage of newly engineered features reflecting SNP clustering and SNP network context and thereby improves rSNP prediction performance over our previous approaches, CERENKOV 17 and CERENKOV2. 16 As in our previous approaches, in CERENKOV3 we use regularized gradient boosted decision trees (XGBoost) as the base classifier due to its superior speed and performance. We combined our original 248-dimensional feature matrix with two new types of features derived from clustering and molecular networks, respectively: (1) "locus size", a static feature based on the number of SNPs within a locus; and (2) a pair of dynamic features extracted by node2vec, 44 an algorithm for learning continuous feature representations for nodes based on network random walks. We constructed a weighted molecular network using the the data sources 4DGenome, 45 Encyclopedia of DNA Elements (ENCODE), and Genotype Tissue Expression (GTEx) for SNP-gene connections; and BioGRID, 46 Coexpedia 47 and HumanNet 48 for gene-gene connections. We treated the edge weights as hyperparameters that we tuned in the method (see Methods).

Reference SNP set and annotation features
A set of experimentally validated SNPs is fundamental to our supervised-learning method. In this work, we used the OSU18 SNP set that we first used in CERENKOV2 16 and that is specifically designed to represent the computational task of post-GWAS rSNP identification. Loci are partitioned naturally by our filtering procedure when choosing SNPs: we only included SNPs within 50 kbp of an rSNP; this partitioning scheme also guarantees the possibility of locus sampling, 17 a group-wise sampling technique that we implemented to assign SNPs to cross-validation (CV) folds by locus. After we analyzed the repeated locussampling based CV performance, we found that, whichever fold it was assigned, the locus with ID chr5_30 (internally meaning the 30th locus on chromosome 5) would hinder the validation performance. From the OSU18 SNPs we pruned one locus (chr5_30) because it contained an anomalously high number of rSNPs (143) that lack supporting documentation in the source database (ClinVar). With that exclusion, the overall class balance of the remaining 38,795 OSU18 SNPs is ~15.26 (ratio of control SNPs, or cSNPs, to rSNPs). As our baseline set of features, we obtained the 248 SNP annotation features from the CERENKOV2 16 feature pipeline. For comparison purposes, we also extracted 175 SNPlevel features from the GWAVA 19 software.

Clustering-derived feature: locus sizes
As mentioned above, when collecting the negative examples, i.e. the cSNPs, we only chose those that were in strong linkage disequilibrium (r 2 ≥ 0.8) with, and located no more than 50 kbp distance from, an rSNP. This procedure provided us a natural way of clustering: we first sorted all the OSU18 SNPs by their locations per chromosome. Then, for any pair of neighboring SNPs on the same chromosome, if the distance between their chromosome positions is greater than 50 kbp, we divided them into two separate loci. In this way, we partitioned the OSU18 SNP set into 1290 loci, each containing approximately 30 SNPs on average. Then for each SNP, the number of SNPs within its locus is computed as the "locus size" feature.

Construction of molecular networks
We constructed an undirected SNP-proximal molecular network for CERENKOV3 as follows (Fig. 1): First, vertex types were limited to SNPs and genes in order to reduce constructional and computational overhead. Second, we integrated eight data sources of SNP-gene interactions (five interaction types using data from four sources, namely, 4DGenome, 45 GTEx, Ensembl, and ENCODE) and gene-gene interactions (three sources) in order to maximize the connectedness of the network. Third, under the premise that different data sources are likely to have different degrees of relevance/informativeness for the rSNP prediction task, we assigned edges numerical weights according to the relation types they represented (see colored edges in Fig. 1); we treated these weights as hyperparamters of our classifer that we tuned empirically to maximize performance (see Sec. Machine learning pipeline and hyperparameter tuning). To construct the CERENKOV3 network, we used as vertices the pruned OSU18 SNP set and all human genes from Ensembl (release GRCh37). We mapped Ensembl gene IDs to NCBI IDs using BioMart as needed for integrating genegene interaction data sources.
Detailed procedure for obtaining SNP-gene edges:

•
For any single SNP s, if among all candidate gene vertices there is a gene g whose transcription start site (TSS) lies closest downstream to that gene, we drew an edge between s and g. We call this a "nearest-gene" SNP-gene relation, as it is based on SNP-TSS proximity.
• 4DGenome is a public database of chromatin interaction records that contains over three million human chromatin interactions curated from a comprehensive collection of 3C, 4C, 5C, ChIA-PET, Hi-C and IM-PET 49 studies. If a SNP s and the TSS of a gene g exclusively located in two interacted regions reported by 4DGenome, we added a s-g connection in the network. Furthermore, for any gene g, we defined the promoter region to span the range from 2000 bp upstream to 500 bp downstream of its TSS. Similarly, for any pair of interacted regions reported by 4DGenome, if the genomic region contains a SNP s and partially overlaps with a gene g's promoter section exclusively, such an s-g edge will also be included in our CERENKOV3 network. We call these two types of relations 4DGt (for TSS proximity) and 4DGp (for promoter proximity), respectively.
• GTEx is a comprehensive public resource to study tissue-specific gene expression and regulation. GTEx defines "eGenes" as genes with at least one SNP in cis significantly associated, at a false discovery rate (FDR) of ≤ 0.05, with expression differences of that gene. We used single-tissue cis-eQTL data from GTEx Analysis V7 and we incorporated all SNP-eGene associations into our CERENKOV3 network as edges.

•
The last set of SNP-gene edges were obtained from connections through overlapping transcription factor binding sites (TFBS) using the UCSC Genome Browser and MyGene.info application programming interface (API). First, we used an inner join between the All SNP (build 146) and Transcription Factor ChIP-seq Clusters V3 tables of the GRCh37 assembly from the UCSC Genome Browser to obtain all TFBS symbols overlapping with our pruned OSU18 SNP set; then we used MyGene.info API in order to find all genes that are translated into the corresponding transcription factors.

Detailed procedure for obtaining gene-gene edges:
We directly obtained gene-gene edges from BioGRID, 46 Coexpedia 47 and HumanNet. 48

•
BioGRID is an online biological interaction repository with data compiled through comprehensive curation efforts. We used version 3.5.171 to extract all gene-gene pairs which participates in the interactions reported and augmented our network.
• Coexpedia and HumanNet (v2) are two gene co-expression databases and serve as a natural source of gene-gene edges.

Network-derived features
Once the molecular network was constructed and a set of edge weight hyperparameters assigned (within the context of a hyperparameter search algorithm), we used node2vec 44 to extract low-dimension continuous representations for each network vertex. Specifically, through a set of parameters controlling the usage of breadth-first and depth-first searches, node2vec provides a way of balancing the exploration-exploitation tradeoff when generating random walks for each vertex. Once the random walks are completed, node2vec calls word2vec, 50 a word embedding algorithm, to generate embeddings on the string representations of the random walks. The dimension of node2vec's output, i.e., the number of network-derived features, is not determined in advance; instead we optimized it within the hyperparameter search.

Machine learning pipeline and hyperparameter tuning
As shown in Fig. 2, the pipeline of CERENKOV3 includes three major steps:

1.
The unweighted network is saved in edge-list format and then assigned weights dynamically according to the types of edges, i.e., the types of SNP-gene or genegene relations.

2.
The weighted network is sent to node2vec and the embeddings are generated and output as new features for the classifier.

3.
The network-derived features and clustering-derived feature, locus sizes, are integrated with the baseline 248-dimension SNP features. The combined feature matrix is input to the XGBoost classifier within the context of a replicated, locus sampling-based, five-fold cross-validation training process, with performance measures obtained on the validation sets.
The whole pipeline is wrapped into a custom scikit-learn estimator object, whose three sets of hyperparameters are as follows:

2.
The hyperparameters of node2vec, including d (the number of output dimensions), r (the number of random walks for each vertex), l (the length of each random walk), k (the context window size when calling word2vec), p (the "return" degree, controlling the probability to go back to the visted vertex) and q (the "inout" degree, controlling the probability to explore undiscovered parts of the network).

3.
The hyperparameters of XGBoost as below: max_depth (the maximum tree depth for base learners), learning_rate, n_estimators (the number of trees to fit), gamma (the minimum loss reduction required to make a further partition on a leaf node of the tree), subsample (subsample ratio of the training instance) and colsample_bytree (the subsample ratio of columns when constructing each tree).
Considering high dimensionality (20) of the hyperparameter space, we used a random search method 51 to approximate the optimal configuration. The random search works on the assumption that 1% of the hyperparameter configuratoins will lead to close-to-optimal performance. Based on this assumption, with n ≥ 240 trials, we would expect to find a closeto-optimal configuration with a high probability of 1 − (1 − 0.01) n > 0.99.
For the CERENKOV3 machine learning pipeline, we used a combination of bash, bedtools (v2.25.0), the R statistical computing environment (version 3.4.4), scikit-learn (version 0.21.2) and Python 3.5.2, all under Ubuntu 16.04. In addition, for the purpose of comparison, we also generated features for the pruned OSU18 SNP set with the GWAVA 19 program and then applied Random Forest algorithm with R package ranger version 0.6.0 with the published hyperparameters. To make a fair comparison, we adapted the same crossvaliation settings, fold assignments, and performance measurements for all classifiers.

Random search in hyperparameter space
We carried out a 240-trial random search on hyperparameters with XGBoost on a basis of ten-fold replicated, locus sampling-based, five-fold cross-validation and esetimated the optimal hyperparameters as shown below.
For edge weights, we set the options for random search within a set of real numbers {0.0, 0.1, 0.3, 1.0, 3.0}. The optimized edge weights appear to emphasize snp-gene promoterproximity edges through 4DGenome data and gene-gene edges from BioGRID. In terms of node2vec parameters, we found in general that lower dimensions of output (d) and longer distances of random walks (l) perform best. In addition, the optimal combination of p = 4 and q = 8 means probablistically in our constructed network, more breadth-first searches were carried out than depth-first ones in the optimal configuration.

Analysis of newly engineered features
For each SNP, we obtained the locus size and the optimal two-dimensional embedding returned by node2vec. We first analyzed these three new features for the two SNP classes (rSNPs and cSNPs) using kernel density estimation for feature values conditioned on the class label (rSNP or cSNP) of the reference SNP. As seen in Fig. 3, there are evident likelihood differences (particularly reflecting differences in the skewness and kurtoses of the distributions) that could be exploited by XGBoost. For locus sizes, the feature distribution for rSNPs are slightly more left-shifted and more leptokurtic than the distributions for cSNPs; in terms of the first network-derived feature, the distribution for rSNPs is more shifted to the right; for the second network-derived feature, both of the distributions are more leptokurtic than those of the first features and similarly the distribution for rSNPs are more right-shifted.

Comparison of performance
Using the above-described optimal hyperparameters and cross-validation framework, we compared the performance of GWAVA, CERENKOV, CERENKOV2, and CERENKOV3 in terms of AUPVR, AUROC, and AVGRANK (Fig. 4). CERENKOV3 was the bestperforming algorithm overall, significantly outperforming GWAVA (which was the bestperforming of the nine competing algorithms in our previous study 17

Conclusion and discussion
We have demonstrated, using side-by-side comparisons on identical assignments of SNPs to cross-validation folds, that CERENKOV3's performance exceeds that of our previous CERENKOV, by both classical global rank-based measures (AUPRC and AUROC) and by the GWAS-oriented performance measure, AVGRANK. In particular, CERENKOV3's validation-set AUPRC performance, 0.459, is a significant improvement over CERENKOV2's AUPRC of 0.418 on the same pruned reference SNP set. These results reveal CERENKOV3's ability, by virtue of its novel features based on clustering and molecular networks, to contribute to solving the problem of identifying candidate causal noncoding SNPs in GWAS summary regions.
We anticipate that CERENKOV3's performance may be further improved through several possible enhancements. An appealing extension would be to combine deep neural networkbased approaches based on the local 1 kbp sequence haplotype (recognizing that the local haplotype provides important correlates of functional SNP alleles 52 ), with CERENKOV3's current set of 251 SNP features. Our previous work 31 has demonstrated that a classifier (Res2s2am) based on a deep residual network architecture has state-of-the-art performance on the related problem of discriminating trait-associated noncoding SNPs from control noncoding SNPs. Another direction of improvement is to continue feature engineering from clustering and networks. For example, currently, graph neural networks (GNN) are showing promise for integrating the SNP annotaion features and the connections between them. With GNN, it is possible to carry out representation learning on annotation features through graph embedding. The complete source code for CERENKOV3 is publicly available under an open-source license via GitHub at https://github.com/ramseylab/cerenkov3. Data sources and types of relations used in to construct CERENKOV3 network.
Yao and Ramsey